Latency is the time delay for a single operation to complete, while throughput is the number of operations a system can process per unit of time; they are often in tension, with optimizing one frequently coming at the expense of the other.
Latency and throughput are two fundamental performance metrics that together describe a system's capacity and responsiveness. Latency measures the time between initiating a request and receiving the response—how fast the system responds to a single user. Throughput measures how many requests the system can handle concurrently—the system's capacity. While related, they are distinct dimensions, and optimizing one often trades off against the other. Understanding this tension is critical for designing systems that meet specific performance goals.
Definition: Latency is a measure of speed (time per operation). Throughput is a measure of capacity (operations per time).
Unit: Latency is measured in units of time (milliseconds, seconds). Throughput is measured in operations per time unit (requests/second, transactions/minute).
User Perspective: Latency directly impacts user experience (how long a page takes to load). Throughput impacts system capacity (how many users can be served simultaneously).
Optimization Trade-off: Batching operations improves throughput but increases latency for the first item. Caching reduces latency but requires storage overhead.
Consider an API server processing user requests. The server might have a latency of 50ms per request when idle. At 10 concurrent requests, latency might remain 50ms, yielding throughput of 200 requests/second (10 requests / 0.05s). As concurrency increases to 100, the system may reach saturation. With 200 concurrent requests, each request now waits in a queue, causing latency to spike to 500ms. Throughput, however, might peak at 400 requests/second, even though latency has increased 10x. This illustrates the queuing effect: beyond optimal concurrency, adding more load increases latency without increasing throughput.
Database Batch Inserts: Inserting 10,000 records one by one: latency = 10ms per insert, throughput = 100 inserts/second. Batching 1,000 records at once: latency = 500ms per batch, throughput = 2,000 records/second. Throughput improved 20x, but latency for the last record increased dramatically.
CDN vs Direct Server: Direct server request: latency = 200ms, throughput = 1,000 requests/second. CDN with caching: latency = 50ms (closer to user), throughput = 10,000 requests/second (offloaded from origin). Both metrics improved.
Compression Trade-off: Sending uncompressed data: latency = 100ms network transfer, throughput = 10MB/s. Compressing data: 20ms CPU compression + 20ms transfer = 40ms latency, 25MB/s effective throughput. Latency improved (faster network time) despite compression overhead.
Video Streaming: Low latency (live sports): buffering minimized, but encoding quality may be reduced. High throughput (on-demand): higher quality requires larger buffer (higher latency).
Latency and throughput are linked by Little's Law, a fundamental result from queuing theory: L = λW, where L is average number of requests in the system, λ is arrival rate (throughput), and W is average time in system (latency). This relationship shows that for a given concurrency level, increasing throughput inevitably increases latency. In practice, systems have a saturation point beyond which increasing offered load actually decreases throughput while latency explodes—a phenomenon known as performance collapse. This is why performance testing must measure both metrics under realistic load conditions.
To Improve Latency: Use caching, reduce network hops, optimize code paths, use faster hardware, compress data, implement edge computing, reduce serialization overhead.
To Improve Throughput: Add parallelism (more servers, more cores), use batching, implement queue-based decoupling, optimize for concurrency, use faster storage, scale horizontally.
When Latency is Critical: Real-time applications (trading systems, gaming, live video), user-facing APIs where responsiveness is key, database transactions, authentication services.
When Throughput is Critical: Batch processing (ETL jobs, reporting), data ingestion pipelines, backup systems, offline analytics, message queue processing.